Skip to content

Conversation

MML-coder
Copy link
Collaborator

The SyntheticTextItemsGenerator was generating prompts that could trigger vLLM's automatic prefix caching, leading to hitting the prefix cache up to 80% in some cases during the performance benchmarking.

Implemented unique prefix injection to guarantee 0% prefix cache hit rate while maintaining realistic prompt characteristics.

Test:
Performing some tests on the H200 target accelerator to confirm the fix.

@MML-coder MML-coder marked this pull request as ready for review July 8, 2025 18:58
@MML-coder
Copy link
Collaborator Author

I am trying to figure out lint errors. When i run it locally they all seemed to have passed. :)

ruff check --fix tests/unit/dataset/test_synthetic.py
All checks passed!

@MML-coder
Copy link
Collaborator Author

End to end test:

Ran following command for inference server running llama

command:
`
guidellm benchmark --target 'http://llama-4-maverick-fp8-c94dbf44-predictor.kserve-e2e-perf.svc.cluster.local:8080/v1' --model RedHatAI/Llama-4-Maverick-17B-128E-Instruct-FP8 --processor RedHatAI/Llama-4-Maverick-17B-128E-Instruct-FP8 --data='{"prompt_tokens":512 ,"prompt_tokens_stdev":128,"prompt_tokens_min":1,"prompt_tokens_max":1024,"output_tokens":2048,"output_tokens_stdev":64,"output_tokens_min":1,"output_tokens_max":4096}' --rate-type concurrent --rate "100" --warmup-percent 0.2 --max-requests 500 --output-path output.json

`

VLLM output:
INFO 07-08 17:56:44 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1809.3 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.5%, Prefix cache hit rate: 0.0% INFO 07-08 17:56:54 [loggers.py:116] Engine 000: Avg prompt throughput: 121.7 tokens/s, Avg generation throughput: 1689.8 tokens/s, Running: 99 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.7%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:04 [loggers.py:116] Engine 000: Avg prompt throughput: 1136.3 tokens/s, Avg generation throughput: 1267.3 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.9%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:14 [loggers.py:116] Engine 000: Avg prompt throughput: 1584.5 tokens/s, Avg generation throughput: 1106.8 tokens/s, Running: 99 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.2%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:24 [loggers.py:116] Engine 000: Avg prompt throughput: 1471.5 tokens/s, Avg generation throughput: 1096.7 tokens/s, Running: 98 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:34 [loggers.py:116] Engine 000: Avg prompt throughput: 611.2 tokens/s, Avg generation throughput: 1518.6 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:44 [loggers.py:116] Engine 000: Avg prompt throughput: 52.7 tokens/s, Avg generation throughput: 1629.9 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.8%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:54 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1759.5 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.3%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:04 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1769.4 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.9%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:14 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1769.2 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.4%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:24 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1789.4 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.9%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:34 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1799.9 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.4%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:44 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1839.6 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.0%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:54 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1829.2 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.5%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:04 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1724.9 tokens/s, Running: 92 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.4%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:14 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1309.3 tokens/s, Running: 46 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.3%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:24 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 426.8 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:34 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:44 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

📦 Build Artifacts Available
The build artifacts (.whl and .tar.gz) have been successfully generated and are available for download: https://github.com/neuralmagic/guidellm/actions/runs/16178342451/artifacts/3498543434.
They will be retained for up to 30 days.

@MML-coder
Copy link
Collaborator Author

pre-commit run --all-files trim trailing whitespace.................................................Passed fix end of files.........................................................Passed run linter...............................................................Passed run formatter............................................................Passed mypy.....................................................................Passed

@markurtz markurtz added this to the v0.3.0 milestone Aug 13, 2025
@sjmonson sjmonson assigned sjmonson and unassigned MML-coder Aug 14, 2025
@sjmonson sjmonson force-pushed the prefix_cache_invalidate branch 2 times, most recently from ca35625 to 6662be6 Compare August 14, 2025 20:36
MML-coder and others added 2 commits August 18, 2025 16:32
Co-authored-by: Mehul <[email protected]>
Co-authored-by: Samuel Monson <[email protected]>
Signed-off-by: Samuel Monson <[email protected]>
@sjmonson sjmonson force-pushed the prefix_cache_invalidate branch from 6662be6 to da29a71 Compare August 18, 2025 20:38
@sjmonson
Copy link
Collaborator

Merging work into #183

@sjmonson sjmonson closed this Aug 18, 2025
@sjmonson sjmonson deleted the prefix_cache_invalidate branch August 18, 2025 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants